Implement the new tuning API for `DeviceTransform` #6914

bernhardmgruber · 2025-12-08T17:57:22Z

Fixes: #6919
Fixes: #5057
Fixes: #3017

Compile time of cub.test.device.transform.lid_0 using nvcc 13.0 and clang 20 for sm86, sm120

branch:
1m49.900s
1m50.615s
1m50.255s

main:
1m56.917s
1m57.378s
1m59.371s

Compile time of cub.test.device.transform.lid_0 for sm86, sm120 using clang 20 in CUDA mode:

branch:
real 1m40.627s
real 1m40.675s
real 1m40.912s

main:
real 1m39.273s
real 1m39.669s
real 1m39.835s

copy-pr-bot · 2025-12-08T17:57:26Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

c/parallel/src/transform.cu

cub/cub/device/dispatch/kernels/kernel_transform.cuh

miscco · 2025-12-10T08:35:31Z

cub/cub/device/dispatch/kernels/kernel_transform.cuh

+#if _CCCL_HAS_CONCEPTS()
+  requires transform_policy_hub<ArchPolicies>
+#endif // _CCCL_HAS_CONCEPTS()


Nitpick: I believe we should either use the concept emulation or plain SFINAE in C++17 too

Hmm. We could also static_assert, but ArchPolicies is already used in the kernel attributes before we reach the body. And using a static_assert would only be evaluated in the device path.

How would I write that using concept emulation and have the concept check before the __launch_bounds__?

We could write:

Suggested change

#if _CCCL_HAS_CONCEPTS()

requires transform_policy_hub<ArchPolicies>

#endif // _CCCL_HAS_CONCEPTS()

_CCCL_TEMPLATE(typename PolicySelector,

typename Offset,

typename Predicate,

typename F,

typename RandomAccessIteratorOut,

typename... RandomAccessIteratorsIn)

_CCCL_REQUIRES(transform_policy_selector<PolicySelector>)

Yeah, but as discussed on Slack before, we would need to get transform_policy_selector and then policy_selector working, which we couldn't of the is_constant_expression check. Let's leave it.

cub/cub/device/dispatch/kernels/kernel_transform.cuh

cub/cub/device/dispatch/tuning/tuning_transform.cuh

miscco · 2025-12-10T08:45:51Z

cub/cub/device/dispatch/tuning/tuning_transform.cuh

+    bool all_inputs_contiguous                  = true;
+    bool all_input_values_trivially_reloc       = true;
+    bool can_memcpy_contiguous_inputs           = true;
+    bool all_value_types_have_power_of_two_size = ::cuda::is_power_of_two(output.value_type_size);
+    for (const auto& input : inputs)
+    {
+      all_inputs_contiguous &= input.is_contiguous;
+      all_input_values_trivially_reloc &= input.value_type_is_trivially_relocatable;
+      // the vectorized kernel supports mixing contiguous and non-contiguous iterators
+      can_memcpy_contiguous_inputs &= !input.is_contiguous || input.value_type_is_trivially_relocatable;
+      all_value_types_have_power_of_two_size &= ::cuda::is_power_of_two(input.value_type_size);
+    }


Nitpick: While it is technically more efficient, I believe it would improve readability if we did

const bool all_inputs_contiguous = ::cuda::std::all_of(input.begin(), input.end(), [](const auto& input) { return input.is_contiguous; })

Can I do this later? Maybe we have std::ranges::all_of by then.

cub/cub/device/dispatch/tuning/tuning_transform.cuh

bernhardmgruber · 2025-12-11T11:43:11Z

I see tiny changes in the generated SASS for cub.bench.transform.babelstream.base, notable in the filling kernels (no inputs) for complex<float>. The compiler now generates STG.E.ENL2.256, which it didn't do before.

The fill lernel for int128 seems to have degraded from generating STG.E.128 to a lot more STG.E.

All kernels with a functor marked as __callable_permitting_copied_arguments show no changes. That's good.

It feels a bit like the items per thread changed for the fill kernels.

bernhardmgruber · 2025-12-11T13:00:43Z

It feels a bit like the items per thread changed for the fill kernels.

They did. Before we had a tuning policy for sm_120, that was not taken into account :D This PR now uses it.

bernhardmgruber · 2025-12-11T13:19:45Z

I disabled the sm120 fill policy and now the only SASS diff for filling is on:

void cub::_V_300300_SM_1200::detail::transform::transform_kernel<cub::_V_300300_SM_1200::detail::transform::policy_hub<false, true, cuda::std::__4::tuple<cuda::__4::counting_iterator<long, 0, 0>>, unsigned long*>::policy1000, long, cub::_V_300300_SM_1200::detail::transform::always_true_predicate, cuda::__4::__callable_permitting_copied_arguments<(anonymous namespace)::lognormal_adjust_t<unsigned long>>, unsigned long*, cuda::__4::counting_iterator<long, 0, 0>>(long, int, bool, cub::_V_300300_SM_1200::detail::transform::always_true_predicate, cuda::__4::__callable_permitting_copied_arguments<(anonymous namespace)::lognormal_adjust_t<unsigned long>>, unsigned long*, cub::_V_300300_SM_1200::detail::transform::kernel_arg<cuda::__4::counting_iterator<long, 0, 0>>)

which is a thrust::tabulate of a counting_iterator<long> and an unsigned long*.

bernhardmgruber · 2025-12-11T16:45:54Z

Found the final issue with the fill kernels. Disabled the vectorized tunings when we have input streams (they were tuned for output only use cases). SASS of cub.bench.transform.fill.base now matches baseline on sm120.

gevtushenko

Excited to see the new tuning machinery at work! Code is much more readable now and we no longer have to parse PTX 🎉

cub/benchmarks/bench/transform/common.h

cub/benchmarks/bench/transform/babelstream.cu

cub/cub/device/dispatch/tuning/tuning_transform.cuh

c/parallel/src/transform.cu

bernhardmgruber · 2026-01-15T13:48:47Z

I see EVO correctly failing to build invalid parameter configurations:

2026-01-15 12:55:46,259: starting build for cub.bench.transform.babelstream.bif_-8.alg_3.tpb_384.vsp2_2.vpt_2: cmake --build . --target cub.bench.transform.babelstream.variant
2026-01-15 12:55:52,387: finished build for cub.bench.transform.babelstream.bif_-8.alg_3.tpb_384.vsp2_2.vpt_2 (exit code: 2) in 6.128s
2026-01-15 12:55:53,348: found cached base build for cub.bench.transform.babelstream.base
2026-01-15 12:55:53,352: found cached base build for cub.bench.transform.babelstream.base
2026-01-15 12:55:58,146: starting build for cub.bench.transform.babelstream.bif_-16.alg_0.tpb_512.vsp2_4.vpt_3: cmake --build . --target cub.bench.transform.babelstream.variant
2026-01-15 12:56:05,124: finished build for cub.bench.transform.babelstream.bif_-16.alg_0.tpb_512.vsp2_4.vpt_3 (exit code: 2) in 6.978s
2026-01-15 12:56:06,090: found cached base build for cub.bench.transform.babelstream.base
2026-01-15 12:56:06,094: found cached base build for cub.bench.transform.babelstream.base
2026-01-15 12:56:07,406: starting build for cub.bench.transform.babelstream.bif_0.alg_2.tpb_128.vsp2_2.vpt_1: cmake --build . --target cub.bench.transform.babelstream.variant
2026-01-15 12:56:14,887: finished build for cub.bench.transform.babelstream.bif_0.alg_2.tpb_128.vsp2_2.vpt_1 (exit code: 2) in 7.481s
2026-01-15 12:56:47,128: found cached base build for cub.bench.transform.babelstream.base
2026-01-15 12:56:47,134: found cached base build for cub.bench.transform.babelstream.base
2026-01-15 12:56:55,717: starting build for cub.bench.transform.babelstream.bif_-4.alg_1.tpb_512.vsp2_5.vpt_1: cmake --build . --target cub.bench.transform.babelstream.variant
2026-01-15 12:57:23,938: finished build for cub.bench.transform.babelstream.bif_-4.alg_1.tpb_512.vsp2_5.vpt_1 (exit code: 0) in 28.221s

What's a bit worrying is that in the course of 4h and across 8 GPUs not a single algorithm other than 1 succeeded in building. This may suggest we need a different approach, but we can address this later.

github-actions · 2026-01-15T20:27:49Z

🥳 CI Workflow Results

🟩 Finished in 7h 49m: Pass: 100%/133 | Total: 1d 20h | Max: 4h 29m | Hits: 98%/177904

See results here.

PR NVIDIA#6914 seems to have missed a few pieces to remove

PR #6914 seems to have missed a few pieces to remove

github-project-automation bot added this to CCCL Dec 8, 2025

github-project-automation bot moved this to Todo in CCCL Dec 8, 2025

cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Dec 8, 2025

bernhardmgruber force-pushed the tuning_transform branch from 4244463 to 43feb21 Compare December 8, 2025 22:44

bernhardmgruber commented Dec 9, 2025

View reviewed changes

c/parallel/src/transform.cu Outdated Show resolved Hide resolved

bernhardmgruber commented Dec 9, 2025

View reviewed changes

c/parallel/src/transform.cu Outdated Show resolved Hide resolved

bernhardmgruber marked this pull request as ready for review December 9, 2025 07:44

bernhardmgruber requested review from a team as code owners December 9, 2025 07:44

bernhardmgruber requested review from fbusato and gevtushenko December 9, 2025 07:44

cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Dec 9, 2025

bernhardmgruber force-pushed the tuning_transform branch from fca1221 to 2aade5f Compare December 9, 2025 08:03

This comment has been minimized.

Sign in to view

bernhardmgruber force-pushed the tuning_transform branch from 2aade5f to 57cc332 Compare December 10, 2025 08:21

bernhardmgruber requested a review from a team as a code owner December 10, 2025 08:48

miscco reviewed Dec 10, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

bernhardmgruber force-pushed the tuning_transform branch from cb0fac5 to 1d14a3e Compare December 10, 2025 17:49

This comment has been minimized.

Sign in to view

bernhardmgruber force-pushed the tuning_transform branch from 1d14a3e to a661d8f Compare December 11, 2025 11:09

gonidelis self-requested a review December 11, 2025 16:44

bernhardmgruber force-pushed the tuning_transform branch from 1139c44 to c8b2ef6 Compare December 11, 2025 17:24

bernhardmgruber added 9 commits January 15, 2026 01:20

Reviewer feedback

7843808

Rename

a178fea

TODO: review this fix

547fed1

Missing include

99d3f4f

Add note about CCCL_STORAGE

76effca

Refactoring and cleanup

8e2e8a2

Workaround concept compilation failure on nvcc < 13.0

0f4d7b0

Move concepts check

d707558

Drop constexpr check from policy_selector

2ba9470

bernhardmgruber force-pushed the tuning_transform branch from 51c844f to 2ba9470 Compare January 15, 2026 00:23

gevtushenko approved these changes Jan 15, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

bernhardmgruber added 6 commits January 15, 2026 08:42

Fix benchmark issues

5ef6334

Rework tuning parameters

4382433

Apply reviewer feedback

4158226

Revisit tuning parameter checks

6b6ca26

Fix key name for TUNE_BIF_BIAS

cc23ab0

Fix namespace

d53d405

This comment has been minimized.

Sign in to view

bernhardmgruber merged commit 6e592be into NVIDIA:main Jan 15, 2026
282 of 285 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Jan 15, 2026

bernhardmgruber deleted the tuning_transform branch January 15, 2026 22:46

This was referenced Jan 15, 2026

Tuning cub::DeviceTransform does not pick algorithms other than the vectorized code path #7262

Open

Fix extracting CUDA stream in cub::DeviceTransform #7239

Merged

bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this pull request Jan 15, 2026

Drop leftover code after tuning API migration

ba79baa

PR NVIDIA#6914 seems to have missed a few pieces to remove

This was referenced Jan 15, 2026

Drop leftover code after tuning API migration #7264

Merged

Update cub::DeviceTransform vectorized path tuning #7265

Open

bernhardmgruber added a commit that referenced this pull request Jan 16, 2026

Drop leftover code after tuning API migration (#7264)

6b67881

PR #6914 seems to have missed a few pieces to remove

-#if _CCCL_HAS_CONCEPTS()
-  requires transform_policy_hub<ArchPolicies>
-#endif // _CCCL_HAS_CONCEPTS()
+_CCCL_TEMPLATE(typename PolicySelector,
+          typename Offset,
+          typename Predicate,
+          typename F,
+          typename RandomAccessIteratorOut,
+          typename... RandomAccessIteratorsIn)
+_CCCL_REQUIRES(transform_policy_selector<PolicySelector>)

Implement the new tuning API for DeviceTransform #6914

Implement the new tuning API for DeviceTransform #6914

Conversation

bernhardmgruber commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Dec 8, 2025

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

miscco Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

miscco Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

miscco Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

bernhardmgruber commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bernhardmgruber commented Dec 11, 2025

Uh oh!

bernhardmgruber commented Dec 11, 2025

Uh oh!

bernhardmgruber commented Dec 11, 2025

Uh oh!

gevtushenko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

bernhardmgruber commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

github-actions bot commented Jan 15, 2026

🥳 CI Workflow Results

🟩 Finished in 7h 49m: Pass: 100%/133 | Total: 1d 20h | Max: 4h 29m | Hits: 98%/177904

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Implement the new tuning API for `DeviceTransform` #6914

Implement the new tuning API for `DeviceTransform` #6914

bernhardmgruber commented Dec 8, 2025 •

edited

Loading

bernhardmgruber commented Dec 11, 2025 •

edited

Loading

bernhardmgruber commented Jan 15, 2026 •

edited

Loading